191 research outputs found
Learning to Hallucinate Face Images via Component Generation and Enhancement
We propose a two-stage method for face hallucination. First, we generate
facial components of the input image using CNNs. These components represent the
basic facial structures. Second, we synthesize fine-grained facial structures
from high resolution training images. The details of these structures are
transferred into facial components for enhancement. Therefore, we generate
facial components to approximate ground truth global appearance in the first
stage and enhance them through recovering details in the second stage. The
experiments demonstrate that our method performs favorably against
state-of-the-art methodsComment: IJCAI 2017. Project page:
http://www.cs.cityu.edu.hk/~yibisong/ijcai17_sr/index.htm
Egocentric Hand Detection Via Dynamic Region Growing
Egocentric videos, which mainly record the activities carried out by the
users of the wearable cameras, have drawn much research attentions in recent
years. Due to its lengthy content, a large number of ego-related applications
have been developed to abstract the captured videos. As the users are
accustomed to interacting with the target objects using their own hands while
their hands usually appear within their visual fields during the interaction,
an egocentric hand detection step is involved in tasks like gesture
recognition, action recognition and social interaction understanding. In this
work, we propose a dynamic region growing approach for hand region detection in
egocentric videos, by jointly considering hand-related motion and egocentric
cues. We first determine seed regions that most likely belong to the hand, by
analyzing the motion patterns across successive frames. The hand regions can
then be located by extending from the seed regions, according to the scores
computed for the adjacent superpixels. These scores are derived from four
egocentric cues: contrast, location, position consistency and appearance
continuity. We discuss how to apply the proposed method in real-life scenarios,
where multiple hands irregularly appear and disappear from the videos.
Experimental results on public datasets show that the proposed method achieves
superior performance compared with the state-of-the-art methods, especially in
complicated scenarios
Stylizing Face Images via Multiple Exemplars
We address the problem of transferring the style of a headshot photo to face
images. Existing methods using a single exemplar lead to inaccurate results
when the exemplar does not contain sufficient stylized facial components for a
given photo. In this work, we propose an algorithm to stylize face images using
multiple exemplars containing different subjects in the same style. Patch
correspondences between an input photo and multiple exemplars are established
using a Markov Random Field (MRF), which enables accurate local energy transfer
via Laplacian stacks. As image patches from multiple exemplars are used, the
boundaries of facial components on the target image are inevitably
inconsistent. The artifacts are removed by a post-processing step using an
edge-preserving filter. Experimental results show that the proposed algorithm
consistently produces visually pleasing results.Comment: In CVIU 2017. Project Page:
http://www.cs.cityu.edu.hk/~yibisong/cviu17/index.htm
Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics
We address the problem of video representation learning without
human-annotated labels. While previous efforts address the problem by designing
novel self-supervised tasks using video data, the learned features are merely
on a frame-by-frame basis, which are not applicable to many video analytic
tasks where spatio-temporal features are prevailing. In this paper we propose a
novel self-supervised approach to learn spatio-temporal features for video
representation. Inspired by the success of two-stream approaches in video
classification, we propose to learn visual features by regressing both motion
and appearance statistics along spatial and temporal dimensions, given only the
input video data. Specifically, we extract statistical concepts (fast-motion
region and the corresponding dominant direction, spatio-temporal color
diversity, dominant color, etc.) from simple patterns in both spatial and
temporal domains. Unlike prior puzzles that are even hard for humans to solve,
the proposed approach is consistent with human inherent visual habits and
therefore easy to answer. We conduct extensive experiments with C3D to validate
the effectiveness of our proposed approach. The experiments show that our
approach can significantly improve the performance of C3D when applied to video
classification tasks. Code is available at
https://github.com/laura-wang/video_repres_mas.Comment: CVPR 201
SINet: A Scale-insensitive Convolutional Neural Network for Fast Vehicle Detection
Vision-based vehicle detection approaches achieve incredible success in
recent years with the development of deep convolutional neural network (CNN).
However, existing CNN based algorithms suffer from the problem that the
convolutional features are scale-sensitive in object detection task but it is
common that traffic images and videos contain vehicles with a large variance of
scales. In this paper, we delve into the source of scale sensitivity, and
reveal two key issues: 1) existing RoI pooling destroys the structure of small
scale objects, 2) the large intra-class distance for a large variance of scales
exceeds the representation capability of a single network. Based on these
findings, we present a scale-insensitive convolutional neural network (SINet)
for fast detecting vehicles with a large variance of scales. First, we present
a context-aware RoI pooling to maintain the contextual information and original
structure of small scale objects. Second, we present a multi-branch decision
network to minimize the intra-class distance of features. These lightweight
techniques bring zero extra time complexity but prominent detection accuracy
improvement. The proposed techniques can be equipped with any deep network
architectures and keep them trained end-to-end. Our SINet achieves
state-of-the-art performance in terms of accuracy and speed (up to 37 FPS) on
the KITTI benchmark and a new highway dataset, which contains a large variance
of scales and extremely small objects.Comment: Accepted by IEEE Transactions on Intelligent Transportation Systems
(T-ITS
Context-aware and Scale-insensitive Temporal Repetition Counting
Temporal repetition counting aims to estimate the number of cycles of a given
repetitive action. Existing deep learning methods assume repetitive actions are
performed in a fixed time-scale, which is invalid for the complex repetitive
actions in real life. In this paper, we tailor a context-aware and
scale-insensitive framework, to tackle the challenges in repetition counting
caused by the unknown and diverse cycle-lengths. Our approach combines two key
insights: (1) Cycle lengths from different actions are unpredictable that
require large-scale searching, but, once a coarse cycle length is determined,
the variety between repetitions can be overcome by regression. (2) Determining
the cycle length cannot only rely on a short fragment of video but a contextual
understanding. The first point is implemented by a coarse-to-fine cycle
refinement method. It avoids the heavy computation of exhaustively searching
all the cycle lengths in the video, and, instead, it propagates the coarse
prediction for further refinement in a hierarchical manner. We secondly propose
a bidirectional cycle length estimation method for a context-aware prediction.
It is a regression network that takes two consecutive coarse cycles as input,
and predicts the locations of the previous and next repetitive cycles. To
benefit the training and evaluation of temporal repetition counting area, we
construct a new and largest benchmark, which contains 526 videos with diverse
repetitive actions. Extensive experiments show that the proposed network
trained on a single dataset outperforms state-of-the-art methods on several
benchmarks, indicating that the proposed framework is general enough to capture
repetition patterns across domains.Comment: Accepted by CVPR202
RIGID: Recurrent GAN Inversion and Editing of Real Face Videos
GAN inversion is indispensable for applying the powerful editability of GAN
to real images. However, existing methods invert video frames individually
often leading to undesired inconsistent results over time. In this paper, we
propose a unified recurrent framework, named \textbf{R}ecurrent v\textbf{I}deo
\textbf{G}AN \textbf{I}nversion and e\textbf{D}iting (RIGID), to explicitly and
simultaneously enforce temporally coherent GAN inversion and facial editing of
real videos. Our approach models the temporal relations between current and
previous frames from three aspects. To enable a faithful real video
reconstruction, we first maximize the inversion fidelity and consistency by
learning a temporal compensated latent code. Second, we observe incoherent
noises lie in the high-frequency domain that can be disentangled from the
latent space. Third, to remove the inconsistency after attribute manipulation,
we propose an \textit{in-between frame composition constraint} such that the
arbitrary frame must be a direct composite of its neighboring frames. Our
unified framework learns the inherent coherence between input frames in an
end-to-end manner, and therefore it is agnostic to a specific attribute and can
be applied to arbitrary editing of the same video without re-training.
Extensive experiments demonstrate that RIGID outperforms state-of-the-art
methods qualitatively and quantitatively in both inversion and editing tasks.
The deliverables can be found in \url{https://cnnlstm.github.io/RIGID}Comment: ICCV202
Shunted Self-Attention via Multi-Scale Token Aggregation
Recent Vision Transformer~(ViT) models have demonstrated encouraging results
across various computer vision tasks, thanks to their competence in modeling
long-range dependencies of image patches or tokens via self-attention. These
models, however, usually designate the similar receptive fields of each token
feature within each layer. Such a constraint inevitably limits the ability of
each self-attention layer in capturing multi-scale features, thereby leading to
performance degradation in handling images with multiple objects of different
scales. To address this issue, we propose a novel and generic strategy, termed
shunted self-attention~(SSA), that allows ViTs to model the attentions at
hybrid scales per attention layer. The key idea of SSA is to inject
heterogeneous receptive field sizes into tokens: before computing the
self-attention matrix, it selectively merges tokens to represent larger object
features while keeping certain tokens to preserve fine-grained features. This
novel merging scheme enables the self-attention to learn relationships between
objects with different sizes and simultaneously reduces the token numbers and
the computational cost. Extensive experiments across various tasks demonstrate
the superiority of SSA. Specifically, the SSA-based transformer achieves 84.0\%
Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on
ImageNet with only half of the model size and computation cost, and surpasses
Focal Transformer by 1.3 mAP on COCO and 2.9 mIOU on ADE20K under similar
parameter and computation cost. Code has been released at
https://github.com/OliverRensu/Shunted-Transformer
Deformable Object Tracking with Gated Fusion
The tracking-by-detection framework receives growing attentions through the
integration with the Convolutional Neural Networks (CNNs). Existing
tracking-by-detection based methods, however, fail to track objects with severe
appearance variations. This is because the traditional convolutional operation
is performed on fixed grids, and thus may not be able to find the correct
response while the object is changing pose or under varying environmental
conditions. In this paper, we propose a deformable convolution layer to enrich
the target appearance representations in the tracking-by-detection framework.
We aim to capture the target appearance variations via deformable convolution,
which adaptively enhances its original features. In addition, we also propose a
gated fusion scheme to control how the variations captured by the deformable
convolution affect the original appearance. The enriched feature representation
through deformable convolution facilitates the discrimination of the CNN
classifier on the target object and background. Extensive experiments on the
standard benchmarks show that the proposed tracker performs favorably against
state-of-the-art methods
Fine-grained Domain Adaptive Crowd Counting via Point-derived Segmentation
Due to domain shift, a large performance drop is usually observed when a
trained crowd counting model is deployed in the wild. While existing
domain-adaptive crowd counting methods achieve promising results, they
typically regard each crowd image as a whole and reduce domain discrepancies in
a holistic manner, thus limiting further improvement of domain adaptation
performance. To this end, we propose to untangle \emph{domain-invariant} crowd
and \emph{domain-specific} background from crowd images and design a
fine-grained domain adaption method for crowd counting. Specifically, to
disentangle crowd from background, we propose to learn crowd segmentation from
point-level crowd counting annotations in a weakly-supervised manner. Based on
the derived segmentation, we design a crowd-aware domain adaptation mechanism
consisting of two crowd-aware adaptation modules, i.e., Crowd Region Transfer
(CRT) and Crowd Density Alignment (CDA). The CRT module is designed to guide
crowd features transfer across domains beyond background distractions. The CDA
module dedicates to regularising target-domain crowd density generation by its
own crowd density distribution. Our method outperforms previous approaches
consistently in the widely-used adaptation scenarios.Comment: 10 pages, 5 figures, and 9 table
- …